11 research outputs found
Embedding Multilingual and Relational Data Using Linear Mappings
This thesis presents our research on the embedding method, a machine learning technique that encodes real-world signals into high-dimensional vectors. Specifically, we focus on a family of algorithms whose backbone is one simple yet elegant type of topological operation, the linear mapping, aka. linear transformation or vector space homomorphism. Past studies have shown the usefulness of these approaches for modelling complex data, such as lexicons from different languages and networks storing factual relations. However, they also exhibit crucial limitations, including a lack of theoretical justifications, precision drop in challenging setups, and considerable environmental impact during training, among others.
To bridge these gaps, we first identify the unnoticed link between the success of linear Cross-Lingual Word Embedding (CLWE) mappings and the preservation of the implicit analogy relation, using both theoretical and empirical evidence. Next, we propose a post-hoc L1-norm rotation step which substantially improves the performance of existing CLWE mappings. Then, beyond solving conventional questions where only modern languages are involved, we extend the application of CLWE mappings to summarising lengthy and opaque historical text. Finally, motivated by the learning procedure of CLWE models, we adopt linear mappings to optimise Knowledge Graph Embeddings (KGEs) iteratively, significantly reducing the carbon footprint required to train the algorithm
On the Security Vulnerabilities of Text-to-SQL Models
Recent studies show that, despite being effective on numerous tasks, text
processing algorithms may be vulnerable to deliberate attacks. However, the
question of whether such weaknesses can directly lead to security threats is
still under-explored. To bridge this gap, we conducted vulnerability tests on
Text-to-SQL, a technique that builds natural language interfaces for databases.
Empirically, we showed that the Text-to-SQL modules of two commercial black
boxes (Baidu-UNIT and Codex-powered Ai2sql) can be manipulated to produce
malicious code, potentially leading to data breaches and Denial of Service.
This is the first demonstration of the danger of NLP models being exploited as
attack vectors in the wild. Moreover, experiments involving four open-source
frameworks verified that simple backdoor attacks can achieve a 100% success
rate on Text-to-SQL systems with almost no prediction performance impact. By
reporting these findings and suggesting practical defences, we call for
immediate attention from the NLP community to the identification and
remediation of software security issues
Modeling relation paths for knowledge base completion via joint adversarial training
Knowledge Base Completion (KBC), which aims at determining the missing
relations between entity pairs, has received increasing attention in recent
years. Most existing KBC methods focus on either embedding the Knowledge Base
(KB) into a specific semantic space or leveraging the joint probability of
Random Walks (RWs) on multi-hop paths. Only a few unified models take both
semantic and path-related features into consideration with adequacy. In this
paper, we propose a novel method to explore the intrinsic relationship between
the single relation (i.e. 1-hop path) and multi-hop paths between paired
entities. We use Hierarchical Attention Networks (HANs) to select important
relations in multi-hop paths and encode them into low-dimensional vectors. By
treating relations and multi-hop paths as two different input sources, we use a
feature extractor, which is shared by two downstream components (i.e. relation
classifier and source discriminator), to capture shared/similar information
between them. By joint adversarial training, we encourage our model to extract
features from the multi-hop paths which are representative for relation
completion. We apply the trained model (except for the source discriminator) to
several large-scale KBs for relation completion. Experimental results show that
our method outperforms existing path information-based approaches. Since each
sub-module of our model can be well interpreted, our model can be applied to a
large number of relation learning tasks.Comment: Accepted by Knowledge-Based System
Revisiting the linearity in cross-lingual embedding mappings: from a perspective of word analogies
Most cross-lingual embedding mapping algorithms assume the optimised
transformation functions to be linear. Recent studies showed that on some
occasions, learning a linear mapping does not work, indicating that the
commonly-used assumption may fail. However, it still remains unclear under
which conditions the linearity of cross-lingual embedding mappings holds. In
this paper, we rigorously explain that the linearity assumption relies on the
consistency of analogical relations encoded by multilingual embeddings. We did
extensive experiments to validate this claim. Empirical results based on the
analogy completion benchmark and the BLI task demonstrate a strong correlation
between whether mappings capture analogical information and are linear.Comment: Comments welcome
Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis
Knowledge Graph Embeddings (KGEs) have been intensively explored in recent
years due to their promise for a wide range of applications. However, existing
studies focus on improving the final model performance without acknowledging
the computational cost of the proposed approaches, in terms of execution time
and environmental impact. This paper proposes a simple yet effective KGE
framework which can reduce the training time and carbon footprint by orders of
magnitudes compared with state-of-the-art approaches, while producing
competitive performance. We highlight three technical innovations: full batch
learning via relational matrices, closed-form Orthogonal Procrustes Analysis
for KGEs, and non-negative-sampling training. In addition, as the first KGE
method whose entity embeddings also store full relation information, our
trained models encode rich semantics and are highly interpretable.
Comprehensive experiments and ablation studies involving 13 strong baselines
and two standard datasets verify the effectiveness and efficiency of our
algorithm.Comment: To appear at NAACL 202
Summarising Historical Text in Modern Languages
We introduce the task of historical text summarisation, where documents in
historical forms of a language are summarised in the corresponding modern
language. This is a fundamentally important routine to historians and digital
humanities researchers but has never been automated. We compile a high-quality
gold-standard text summarisation dataset, which consists of historical German
and Chinese news from hundreds of years ago summarised in modern German or
Chinese. Based on cross-lingual transfer learning techniques, we propose a
summarisation model that can be trained even with no cross-lingual (historical
to modern) parallel data, and further benchmark it against state-of-the-art
algorithms. We report automatic and human evaluations that distinguish the
historic to modern language summarisation task from standard cross-lingual
summarisation (i.e., modern to modern language), highlight the distinctness and
value of our dataset, and demonstrate that our transfer learning approach
outperforms standard cross-lingual benchmarks on this task.Comment: To appear at EACL 202
Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis
Knowledge Graph Embeddings (KGEs) have been intensively explored in recent years due to their promise for a wide range of applications. However, existing studies focus on improving the final model performance without acknowledging the computational cost of the proposed approaches, in terms of execution time and environmental impact. This paper proposes a simple yet effective KGE framework which can reduce the training time and carbon footprint by orders of magnitudes compared with state-of-the-art approaches, while producing competitive performance. We highlight three technical innovations: full batch learning via relational matrices, closed-form Orthogonal Procrustes Analysis for KGEs, and non-negative-sampling training. In addition, as the first KGE method whose entity embeddings also store full relation information, our trained models encode rich semantics and are highly interpretable. Comprehensive experiments and ablation studies involving 13 strong baselines and two standard datasets verify the effectiveness and efficiency of our algorithm